population: contains total population counts for each country per year. murder: contains total number of estimated deaths from interpersonal violence for each country per year. happiness: contains happiness score (converted to 0 to 100 scale to be in terms of percentage) for each country per year.
Using these datasets, the variables of interest that we will use to answer our questions are:
Explanatory:Murder rate per 100k people
Response:Happiness Score (converted to 0 to 100 scale to be in terms of percentage)
1.1 Hypothesized Relationship
Before any analysis is done, we anticipate a negative association between average happiness scores and murder rates. It’s plausible that higher murder rates would coincide with a worse perception of security, resulting in lower happiness rates. There could also be factors such as government instability or even organized crime, which may lower happiness as well as increase murder rates.
1.2 Data Cleaning
In order to find the relationship between murder rates and happiness score for each country per year, we need to merge the total murders data set and population data set. That way, we can get the murder rate per 100K people.
Note
We first have to convert all the values into numbers (i.e. fix cases such as 1.1k to be 1100).
After cleaning the total murders and population data set, we can proceed to merging them to get a data set of the murder rate per 100k people for each country and year. We can then use pivot longer to transform the happiness score data set and merge it with the murder rate per 100k data set to get our final data set.
The final data set contains 1,820 rows and 6 columns.
The columns are country, year, murder_count, population, murder_rate_per_100k, and happiness_score.
It provides a comprehensive overview of the murder rates and happiness scores across various countries and years. Each entry in the data set corresponds to a unique combination of a country and a year ranging from 2005 to 2019.
2 Linear Regression
In this part, we will use linear regression to model the relationship between our two quantitative variables, murder rate (log) and happiness score. We want to see if our hypothesis from above holds true and whether there is a negative association between the two variables.
2.1 Data Visualization
TODO: edit this, it reads like chatgpt
Here, we will be create two data visualizations that explore the relationship between our two quantitative variables. Through creative visualizations, we aim to gain a deeper understanding of the relationship between the variables and uncover any interesting patterns or trends that may emerge over time and across countries.
2.1.1 Relationship between Murder Rate and Happiness Score Over Time
Throughout the years of 2005 to 2019, there seems to be a consistent negative association between average happiness and murder rate (log).
2.1.2 Relationship Between Murder Rate and Happiness Score
Code
country_murder_happiness <- murder_happiness |>group_by(country) |>summarise(avg_murder_rate =mean(murder_rate_per_100k),avg_happiness_score =mean(happiness_score))country_murder_happiness |>ggplot(aes(x =log(avg_murder_rate), y = avg_happiness_score) ) +geom_point(color ="steelblue") +geom_smooth(method ="lm", color ="black") +labs(title ="Relationship Between Murder Rate and Happiness Score (2005-2019)",subtitle ="Average Happiness Score",x ="Average Murder Rate (per 100k, log scale)", y ="") +theme_bw()
This plot shows the relationship between the average murder rate per 100k for each country and their overall happiness score. As with the previous plot, there is a negative association between the overall average murder rate and happiness.
2.2 Modeling
TODO: Mention conditions being met? maybe in a note block
We will be using linear regression as a statistical method to model the relationship between murder rate and happiness score. Then we will use this to evaluate the model fit.
x (explanatory): average murder rate per 100K people y (response): average happiness score
The linear regression model suggests that for each each unit increase in the natural logarithm of the average murder rate per 100,000 people, the predicted happiness score decreases by approximately -2.81 points.
The explained variation is 0.088. This means that murder rate explains about 9 percent of the variability in happiness in our model.
3 Simulation
In this section, we aim to evaluate the performance of our linear regression model by comparing the observed data to simulated data generated using the model’s predictions. By simulating data in this manner, we can assess how well our model captures the underlying relationship between murder rate and happiness.
3.1 Visualizing Simulated Data
The comparison between the observed and simulated data will be visualized through side-by-side plots showcasing the relationships modeled by the linear regression for both data sets.
Both the simulated and observed data show clustering at lower murder rates, suggesting a possible non-linear relationship or threshold effect not captured by the linear prediction. However, the simulated data seems to have higher predicted happiness scores at lower murder rates. This discrepancy implies that a linear model provides a simplified representation of the relationship between variables, but it may not accurately capture all the variability and intricacies present in the data set. Therefore, relying solely on a linear model might lead to incomplete or inaccurate conclusions about the underlying phenomena being studied.
3.2 Full Scale Simulation
Code
set.seed(42)simulated_ys <-map(1:1000, ~ predictions +rnorm(n =length(predictions), mean =0, sd = residual_se))simulated_datasets <-map(simulated_ys, ~data.frame(avg_murder_rate =log(country_murder_happiness$avg_murder_rate), avg_happiness_score = .x))# Remove observations with missing valuessimulated_datasets <-map(simulated_datasets, na.omit)# Regress the observed dataset against each simulated dataset and extract R-squared valuesr_squared_values <-map_dbl(simulated_datasets, ~summary(lm(avg_happiness_score ~ avg_murder_rate, data = .x))$r.squared)# Plot distribution of R-squared valueshist_data <-data.frame(R_squared = r_squared_values)ggplot(hist_data, aes(x = R_squared)) +geom_histogram(fill ="steelblue", color ="black", bins =30) +labs(title =expression("Distribution of"~ R^2~"Values"),x =expression("Simulated"~ R^2),y ="",subtitle ="Number of Simulated Models")
TODO: Update!
The histogram shows a right-skewed distribution with a higher frequency on the left side and shorter bars as it moves to the right. We see that the simulated datasets have R-squared values between 0 and 0.075. This suggests that the simulated data shows less variability in happiness scores when comparing with the observed data. It is difficult to ascertain on average how much our simulated data account for the variability in the observed happiness score. But from the graph, one could estimate on average it accounts for <1.25% of the variability in the observed happiness score. The spread of the data also has a narrow spread with values concentrated close to 0, which indicates that the simulated data resembles the observed data.
---title: "STAT331 Final Project Report"author: "Thien An Tran, Tejasree Kandibanda, Matthew Huang, Chloe Anbarcioglu"format: html: embed-resources: true code-tools: true toc: true number-sections: trueeditor: sourceexecute: error: true echo: true message: false warning: falsecode-fold: truereferences:- type: website id: Gapminder URL: https://www.gapminder.org/data/ language: en-US---TODO:- update interpretations to log of average murder rate- add region colors to plots- conclusion# IntroductionTODO:- add more to introFor our project, we want to explore the relationship between murder rate and happiness score in each country per year```{r setup}#| include: falselibrary(tidyverse)library(here)library(gganimate)library(gifski)library(patchwork)library(purrr)murder <- read_csv("data/murder_total_deaths.csv")happiness <- read_csv("data/hapiscore_whr.csv")population <- read_csv("data/pop.csv")```We obtained our data from Gapminder @Gapminder.> **population**: contains total population counts for each country per year.\> **murder**: contains total number of estimated deaths from interpersonal violence for each country per year.\> **happiness**: contains happiness score (converted to 0 to 100 scale to be in terms of percentage) for each country per year.TODO:- Variables selected (define explanatory + response)- Explain why variables selectedUsing these datasets, the variables of interest that we will use to answer our questions are:**Explanatory:** ***Murder rate per 100k people*****Response:** ***Happiness Score***(converted to 0 to 100 scale to be in terms of percentage)## Hypothesized RelationshipBefore any analysis is done, we anticipate a negative association between average happiness scores and murder rates. It's plausible that higher murder rates would coincide with a worse perception of security, resulting in lower happiness rates.There could also be factors such as government instability or even organized crime, which may lower happiness as well as increase murder rates.## Data CleaningIn order to find the relationship between murder rates and happiness score for each country per year, we need to merge the total murders data set and population data set. That way, we can get the murder rate per 100K people.::: callout-noteWe first have to convert all the values into numbers (i.e. fix cases such as `1.1k` to be `1100`).:::```{r}#| output: falseconvert_value <-function(val) { val <-as.character(val) multiplier <-case_when(str_detect(val, "k") ~1e3,str_detect(val, "M") ~1e6,str_detect(val, "B") ~1e9,TRUE~1 ) numeric_value <-as.numeric(str_remove_all(val, "[kMB]"))return(numeric_value * multiplier)}murder_clean <- murder |>select(country, `2005`:`2019`) |>pivot_longer(cols =`2005`:`2019`,names_to ="year",values_to ="murder_count") |>mutate(across(murder_count, ~convert_value(.)))murder_cleanpopulation_clean <- population |>select(country, `2005`:`2019`) |>pivot_longer(cols =`2005`:`2019`,names_to ="year",values_to ="population") |>mutate(across(population, ~convert_value(.)))population_clean```After cleaning the total murders and population data set, we can proceed to merging them to get a data set of the murder rate per 100k people for each country and year. We can then use pivot longer to transform the happiness score data set and merge it with the murder rate per 100k data set to get our final data set.```{r}#| output: falsemurder_pop_merged <- murder_clean |>inner_join(population_clean, by =c("country", "year"))murder_rate_clean <- murder_pop_merged |>mutate(murder_rate_per_100k = (murder_count / population) *100000)happiness_clean <- happiness |>select(country, `2005`:`2019`) |>pivot_longer(cols =`2005`:`2019`,names_to ="year",values_to ="happiness_score") |>drop_na(happiness_score)murder_happiness <- murder_rate_clean |>inner_join(happiness_clean, by =c("country", "year"))murder_happiness |>head() |> knitr::kable(digits =4) ```The final data set contains 1,820 rows and 6 columns.The columns are `country`, `year`, `murder_count`, `population`, `murder_rate_per_100k`, and `happiness_score`.It provides a comprehensive overview of the murder rates and happiness scores across various countries and years. Each entry in the data set corresponds to a unique combination of a country and a year ranging from 2005 to 2019.# Linear RegressionIn this part, we will use linear regression to model the relationship between our two quantitative variables, murder rate (log) and happiness score. We want to see if our hypothesis from above holds true and whether there is a negative association between the two variables.## Data VisualizationTODO: edit this, it reads like chatgptHere, we will be create two data visualizations that explore the relationship between our two quantitative variables. Through creative visualizations, we aim to gain a deeper understanding of the relationship between the variables and uncover any interesting patterns or trends that may emerge over time and across countries.### Relationship between Murder Rate and Happiness Score Over Time```{r}animated_plot <-ggplot(murder_happiness,aes(x =log(murder_rate_per_100k),y = happiness_score)) +geom_point(color ="steelblue") +geom_smooth(method ="lm", color ="black") +labs(title ="Relationship Between Murder Rate and Happiness Score (2005-2019)",subtitle ="Average Happiness Score",x ="Average Murder Rate (per 100k, log scale)",y ="",caption ="Year: {frame_time}") +transition_time(as.integer(year)) +enter_fade() +exit_fade() +theme_bw() +theme(plot.caption =element_text(size =11))animate(animated_plot, renderer =gifski_renderer())```Throughout the years of 2005 to 2019, there seems to be a consistent negative association between average happiness and murder rate (log).### Relationship Between Murder Rate and Happiness Score```{r}country_murder_happiness <- murder_happiness |>group_by(country) |>summarise(avg_murder_rate =mean(murder_rate_per_100k),avg_happiness_score =mean(happiness_score))country_murder_happiness |>ggplot(aes(x =log(avg_murder_rate), y = avg_happiness_score) ) +geom_point(color ="steelblue") +geom_smooth(method ="lm", color ="black") +labs(title ="Relationship Between Murder Rate and Happiness Score (2005-2019)",subtitle ="Average Happiness Score",x ="Average Murder Rate (per 100k, log scale)", y ="") +theme_bw()```This plot shows the relationship between the average murder rate per 100k for each country and their overall happiness score. As with the previous plot, there is a negative association between the overall average murder rate and happiness.## ModelingTODO: Mention conditions being met? maybe in a note blockWe will be using linear regression as a statistical method to model the relationship between murder rate and happiness score. Then we will use this to evaluate the model fit.> **x (explanatory)**: average murder rate per 100K people\> **y (response)**: average happiness score```{r}linear_model <-lm(avg_happiness_score~log(avg_murder_rate), country_murder_happiness)broom::tidy(linear_model) |> knitr::kable(digits =2)```\begin{equation*}\text{Predicted Happiness Score} = 57.77 - 2.81 \times \ln(\text{Murder Rate})\end{equation*}The linear regression model suggests that for each each unit increase in the natural logarithm of the average murder rate per 100,000 people, the predicted happiness score decreases by approximately -2.81 points.## Model Fit```{r}var_response <-var(country_murder_happiness$avg_happiness_score)var_fitted <-var(linear_model$fitted.values)var_resid <-var(linear_model$residuals)explained_variation <- var_fitted/var_responsetable_data <-data.frame(Variable =c("Response Variable Variance", "Fitted Values Variance", "Residuals Variance", "Explained Variation (R^2)"),Value =c(var_response, var_fitted, var_resid, explained_variation))table_data |> knitr::kable(digits =4) ```The explained variation is 0.088. This means that murder rate explains about 9 percent of the variability in happiness in our model.# SimulationIn this section, we aim to evaluate the performance of our linear regression model by comparing the observed data to simulated data generated using the model's predictions. By simulating data in this manner, we can assess how well our model captures the underlying relationship between murder rate and happiness.## Visualizing Simulated DataThe comparison between the observed and simulated data will be visualized through side-by-side plots showcasing the relationships modeled by the linear regression for both data sets.```{r}set.seed(42)predictions <-predict(linear_model, country_murder_happiness)residual_se <-sigma(linear_model)simulated_y <- predictions +rnorm(n =length(predictions), mean =0, sd = residual_se)observed <-ggplot(country_murder_happiness, aes(x =log(avg_murder_rate), y = avg_happiness_score) ) +geom_point(color ="steelblue") +labs(title ="Observed Data",subtitle ="Observed Happiness Score",x ="Average Murder Rate", y ="") +theme_bw()# Plot Simulated Datapredicted <-ggplot(country_murder_happiness, aes(x =log(avg_murder_rate), y = simulated_y) ) +geom_point(color ="orange3") +labs(title ="Simulated Data",subtitle ="Simulated Happiness Score",x ="Average Murder Rate", y ="") +theme_bw()observed + predicted```TODO: update!Both the simulated and observed data show clustering at lower murder rates, suggesting a possible non-linear relationship or threshold effect not captured by the linear prediction. However, the simulated data seems to have higher predicted happiness scores at lower murder rates. This discrepancy implies that a linear model provides a simplified representation of the relationship between variables, but it may not accurately capture all the variability and intricacies present in the data set. Therefore, relying solely on a linear model might lead to incomplete or inaccurate conclusions about the underlying phenomena being studied.## Full Scale Simulation```{r}set.seed(42)simulated_ys <-map(1:1000, ~ predictions +rnorm(n =length(predictions), mean =0, sd = residual_se))simulated_datasets <-map(simulated_ys, ~data.frame(avg_murder_rate =log(country_murder_happiness$avg_murder_rate), avg_happiness_score = .x))# Remove observations with missing valuessimulated_datasets <-map(simulated_datasets, na.omit)# Regress the observed dataset against each simulated dataset and extract R-squared valuesr_squared_values <-map_dbl(simulated_datasets, ~summary(lm(avg_happiness_score ~ avg_murder_rate, data = .x))$r.squared)# Plot distribution of R-squared valueshist_data <-data.frame(R_squared = r_squared_values)ggplot(hist_data, aes(x = R_squared)) +geom_histogram(fill ="steelblue", color ="black", bins =30) +labs(title =expression("Distribution of"~ R^2~"Values"),x =expression("Simulated"~ R^2),y ="",subtitle ="Number of Simulated Models")```TODO: Update!The histogram shows a right-skewed distribution with a higher frequency on the left side and shorter bars as it moves to the right. We see that the simulated datasets have R-squared values between 0 and 0.075. This suggests that the simulated data shows less variability in happiness scores when comparing with the observed data. It is difficult to ascertain on average how much our simulated data account for the variability in the observed happiness score. But from the graph, one could estimate on average it accounts for <1.25% of the variability in the observed happiness score. The spread of the data also has a narrow spread with values concentrated close to 0, which indicates that the simulated data resembles the observed data.# ConclusionIn the end...